-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log compaction failure error and delete temporarily blocks from disk #2261
Conversation
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
I agree that we shouldn't just keep those source blocks of failed compactions forever, but if we now make it impossible to analyze the source blocks of failed compactions because they always get deleted isn't it possible that in the future we'll end up in a situation where we'll miss having them because some compaction failed and we don't know why? Would it make sense to add a runtime config flag which can make the compactor optionally not delete the source blocks of failed compactions of a specific tenant? That way if we do need to investigate some failing compactions we could just enable that flag to make the compactor not delete the source blocks of failed compactions of this tenant. I'm not sure anymore, but I thought in the past we have encountered situations where the ability to analyze the source blocks of failed compactions has been useful to understand an issue, or am I misremembering that? |
Source blocks are still available in the bucket. |
As @pstibrany mentioned, you can download source blocks from object storage anytime. I think having blocks in the container disk is not much useful anyway. To debug them you will have to download them into a workstation where you have all your debugging tooling, so why not downloading them directly from the object storage (which is also faster to download). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice catch with unlogged error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…rafana#2261) * Log compaction failure error and delete temporarily blocks from disk Signed-off-by: Marco Pracucci <marco@pracucci.com> * Well, we have to always delete local dir Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fix unit tests Signed-off-by: Marco Pracucci <marco@pracucci.com>
What this PR does
I've seen a compactor failing to compact some blocks, but the reason is missing from logs because we're not logging the actual error (the error returned by
runCompactionJob()
is never logged).Once the compaction fails, source blocks are left on disk for later investigation. While this is a nice thing, it triggers another issue: subsequent job executions (other jobs) may run out of disk space, because previous failing compaction run jobs (failed) are left on disk. This is an issue which is also happening on the cluster I'm investigating.
In this PR I propose to fix both.
Which issue(s) this PR fixes or relates to
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]